A Characterization of Compound Documents on the Web

نویسندگان

  • Eyal de Lara
  • Dan S. Wallach
  • Willy Zwaenepoel
چکیده

Recent developments in office productivity suites make it easier for users to publish rich compound documents on the Web. Compound documents appear as a single unit of information but may contain data generated by different applications, such as text, images, and spreadsheets. Given the popularity enjoyed by these office suites and the pervasiveness of the Web as a publication medium, we expect that in the near future these compound documents will become an increasing proportion of the Web’s content. As a result, the content handled by servers, proxies, and browsers may change considerably from what is currently observed. Furthermore, these compound documents are currently treated as opaque byte streams, but future Web infrastructure may wish to understand their internal structure to provide higher-quality service. In order to guide the design of this future Web infrastructure, we characterize compound documents currently found on the Web. Previous studies of Web content either ignored these document types altogether or did not consider their internal structure. We study compound documents originated by the three most popular applications from the Microsoft Office suite: Word, Excel, and PowerPoint. Our study encompasses over 12,500 documents retrieved from 935 different Web sites. Our main conclusions are: 1. Compound documents are in general much larger than current HTML documents. 2. For large documents, embedded objects and images make up a large part of the documents’ size. 3. For small documents, XML format produces much larger documents than OLE. For large documents, there is little difference. 4. Compression considerably reduces the size of documents in both formats.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

RRLUFF: Ranking function based on Reinforcement Learning using User Feedback and Web Document Features

Principal aim of a search engine is to provide the sorted results according to user’s requirements. To achieve this aim, it employs ranking methods to rank the web documents based on their significance and relevance to user query. The novelty of this paper is to provide user feedback-based ranking algorithm using reinforcement learning. The proposed algorithm is called RRLUFF, in which the rank...

متن کامل

Web pages ranking algorithm based on reinforcement learning and user feedback

The main challenge of a search engine is ranking web documents to provide the best response to a user`s query. Despite the huge number of the extracted results for user`s query, only a small number of the first results are examined by users; therefore, the insertion of the related results in the first ranks is of great importance. In this paper, a ranking algorithm based on the reinforcement le...

متن کامل

بررسی تولیدات علمی در زمینه حقوق بیماران در عرصه بین‌المللی نمایه شده در پایگاه Web of Science بین سالهای 2000 تا 2014

Introduction: One of the criteria showing the importance of a research area is the scientific products in that research area. The aim of the current study was to investigate the situation of scientific products on the topic of Patients’ rights indexed in ISI-Web of Science between the years 2000 until 2014. Methods: The method used was descriptive-cross sectional with a Scientometrics...

متن کامل

Survey of Iranian gastroenterology and hepatology scientific productions in Web of Science database from 1983 to 2017

Background: One of the most important criteria of the development of countries at the national and international levels is the survey of scientific productions indexed in authentic databases. This study aimed to analyze the scientific productions by Iranian researchers on gastroenterology and hepatology in the Web of Science (WOS) database. Methods: This applied study used a scientometric appr...

متن کامل

An Ensemble Click Model for Web Document Ranking

Annually, web search engine providers spend more and more money on documents ranking in search engines result pages (SERP). Click models provide advantageous information for ranking documents in SERPs through modeling interactions among users and search engines. Here, three modules are employed to create a hybrid click model; the first module is a PGM-based click model, the second module in a d...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1999